Tree Hash EXchange format (THEX) TOC 

            J. Chapweske
             Onion Networks, Inc.
             G. Mohr
             Bitzi, Inc.
             March 4, 2003



Tree Hash EXchange format (THEX)
Abstract
This memo presents the Tree Hash Exchange (THEX) format, for exchanging Merkle 
Hash Trees built up from the subrange hashes of discrete digital files. Such 
tree hash data structures assist in file integrity verification, allowing 
arbitrary subranges of bytes to be verified before the entire file has been 
received. 



       TOC 

Table of Contents
  1.  Introduction
  2.  Merkle Hash Trees
  2.1  Hash Functions
  2.2  Unbalanced Trees
  2.3  Choice Of Segment Size
  3.  Serialization Format
  3.1  DIME Encapsulation
  3.2  XML Tree Description
  3.2.1  File Size
  3.2.2  File Segment Size
  3.2.3  Digest Algorithm
  3.2.4  Digest Output Size
  3.2.5  Serialized Tree Depth
  3.2.6  Serialized Tree Type
  3.2.7  Serialized Tree URI
  3.3  Breadth-First Serialization
  3.3.1  Serialization Type URI
    Authors' Addresses
  A.  Test Vectors





       TOC 

1. Introduction
The Merkle Hash Tree, invented by Ralph Merkle, is a hash construct that 
exhibits desirable properties for verifying the integrity of files and file 
subranges in an incremental or out-of-order fashion. This document describes a 
binary serialization format for hash trees that is compact and optimized for 
both sequential and random access. This memo has two goals: 
  To describe Merkle Hash Trees and how they are used for file integrity 
  verification. 
  To describe a serialization format for storage and transmission of hash trees. 




       TOC 

2. Merkle Hash Trees
It is common practice in distributed systems to use secure hash algorithms to 
verify the integrity of content. The employment of secure hash algorithms 
enables systems to retreive content from completely untrusted hosts with only a 
small amount of trusted metadata. 
Typically, algorithms such as SHA-1 and MD5 have been used to check the content 
integrity after retrieving the entire file. These full file hash techniques work 
fine in an environment where the content is received from a single host and 
there are no streaming requirements. However, there are an increasing number of 
systems that retrieve a single piece of content from multiple untrusted hosts, 
and require content verification well in advance of retrieving the entire file. 
Many modern peer-to-peer content delivery systems employ fixed size "block 
hashes" to provide a finer level of granularity in their integrity checking. 
This approach is still limited in the verification resolution it can attain. 
Additionally, all of the hash information must be retrieved from a trusted host, 
which can limit the scalability and reliability of the system. 
Another way to verify content is to use the hash tree approach. This approach 
has the desired characteristics missing from the full file hash approach and 
works well for very large files. The idea is to break the file up into a number 
of small pieces, hash those pieces, and then iteratively combine and rehash the 
resulting hashes in a tree-like fashion until a single "root hash" is created. 
The root hash by itself behaves exactly the same way that full file hashes do. 
If the root hash is retrieved from a trusted source, it can be used to verify 
the integrity of the entire content. More importantly, the root hash can be 
combined with a small number of other hashes to verify the integrity of any of 
the file segments. 
For example, consider a file made up of four segments, S1, S2, S3, and S4. Let 
H() be the hash function, and '+' indicate concatenation. You could take the 
traditional hash value: 
    VALUE=H(S1+S2+S3+S4)
Or, you could employ a tree approach. The tree hash utilizes two hash algorithms 
- one for leaf hashes and one for internal hashes. Let LH() be the leaf hash 
function and IH() be the internal hash function: 

               ROOT=IH(E+F)
                /      \
               /        \
        E=IH(A+B)       F=IH(C+D)
        /     \           /    \
       /       \         /      \
  A=LH(S1)  B=LH(S2) C=LH(S3)  D=LH(S4)

Now, assuming that the ROOT is retrieved from a trusted source, the integrity of 
a file segment coming from an untrusted source can be checked with a small 
amount of hash data. For instance, if S1 is received from an untrusted host, the 
integrity of S1 can be verified with just B and F. With these, it can be 
verified that, yes: S1 can be combined up to equal the ROOT hash, even without 
seeing the other segments. (It is just as impractical to create falsified values 
of B and F as it is to manipulate any good hash function to give desired results 
-- so B and F can come from untrusted sources as well.) Similarly, if some other 
untrusted source provides segments S3 and S4, their integrity can be easily 
checked when combined with hash E. From segments S3 and S4, the values of C and 
D and then F can be calculated. With these, you can verify that S3 and S4 can 
combine up to create the ROOT -- even if other sources are providing bogus S1 
and S2 segments. Bad info can be immediately recognized and discarded, and good 
info retained, even in situations where you could not even begin to calculate a 
traditional full-file hash. 
Another interesting property of the tree approach is that it can be used to 
verify (tree-aligned) subranges whose size is any multiple of the base segment 
size. 
Consider for example an initial segment size of 1,024 bytes, and a file of 32GB. 
You could verify a single 1,024-byte block, with about 25 proof-assist values, 
or a block of size 16GB, with a single proof-assist value -- or anything in 
between. 
2.1 Hash Functions
The strength of the hash tree construct is only as strong as the underlying hash 
algorithm. Thus, it is RECOMMENDED that a secure hash algorithm such as SHA-1 be 
used as the basis of the hash tree. 
In order to protect against collisions between leaf hashes and internal hashes, 
different hash constructs are used to hash the leaf nodes and the internal 
nodes. The same hash algorithm is used as the basis of each construct, but a 
single '1' byte in network byte order, or 0x01 is prepended to the input of the 
internal node hashes, and a single '0' byte, or 0x00 is prepended to the input 
of the leaf node hashes. 
Let H() be the secure hash algorithm, for example SHA-1. 
internal hash function = IH(X) = H(0x01, X)

leaf hash function = LH(X) = H(0x00, X)
2.2 Unbalanced Trees
For trees that are unbalanced -- that is, they have a number of leaves which is 
not a power of 2 -- interim hash values which do not have a sibling value to 
which they may be concatenated are promoted, unchanged, up the tree until a 
sibling is found. 
For example, consider a file made up of 5 segments, S1, S2, S3, S4, and S5. 
                       ROOT=IH(H+E)
                        /        \
                       /          \
                H=IH(F+G)          E
                /       \           \
               /         \           \
        F=IH(A+B)       G=IH(C+D)     E
        /     \           /     \      \
       /       \         /       \      \
  A=LH(S1)  B=LH(S2) C=LH(S3)  D=LH(S4) E=LH(S5)

In the above example, E does not have any immediate siblings with which to be 
combined to calculate the next generation. So, E is promoted up the tree, 
without being rehashed, until it can be paired with value H. The values H and E 
are then concatenated, and hashed, to produce the ROOT hash. 
2.3 Choice Of Segment Size
Any segment size is possible, but the choice of base segment size establishes 
the smallest possible unit of verification. 
If the the segment size is equal to or larger than the file to be hashed, the 
tree hash value is the value of the single segment's value, which is the same as 
the underlying hash algorithm value for the whole file. 
A segment size equal to the digest algorithm output size would more than double 
the total amount of data to be hashed, and thus more than double the time 
required to calculate the tree hash structure, as compared to a simple full-file 
hash. However, once the segment size reaches several multiples of the digest 
size, calculating the tree adds only a small fractional time overhead beyond 
what a traditional full-file hash would cost. 
Otherwise, smaller segments are better. Smaller segments allow, but do not 
require, the retention and use of fine-grained verification info, (A stack-based 
tree calculation procedure need never retain more than one pending internal node 
value per generation before it can be combined with a sibling, and all interim 
values below a certain generation size of interest can be discarded.) Further, 
it is beneficial for multiple application domains and even files of wildly 
different sizes to share the same base segment size, so that tree structures can 
be shared and used to discover correlated subranges. 
Thus the authors recommend a segment size of 1,024 bytes for most applications, 
as a sort of "smallest common denominator", even for applications involving 
multi-gigabyte or terabyte files. This segment size is 40-50 times larger than 
common secure hash digest lengths (20-24 bytes), and thus adds no more than 
5-10% in running time as compared to the "infinite segment" size case -- the 
traditional full-file hash. 
Considering a 1 terabyte file, the maximum dynamic state required during the 
calculation of the tree root value is 29 interim node values -- less than 1KB 
assuming a 20-byte digest algorithm like SHA-1. Only interim values in 
generations of interest for range verification need to be remembered for tree 
exchange, so if only 8GB ranges ever need to be verified, all but the top 8 
generations of internal values (255 hashes) can be discarded. 



       TOC 

3. Serialization Format
This section presents a serialization format for Merkle Hash Trees that utilizes 
the Direct Internet Message Encapsulation (DIME) format. DIME is a generic 
message format that allows for multiple payloads, either text or binary. The 
Merkle Hash Tree serialization format consists of two different payloads. The 
first is XML encoded meta-data about the hash tree, and the second is binary 
serialization of the tree itself. The binary serialization is required for two 
important reasons: 
  Compactness of Representation - A key virtue of the hash tree approach is that 
  it provides considerable integrity checking power with a relatively small 
  amount of data. A typical hash tree consists of a large number of small 
  hashes. Thus a text encoding, such as XML, could easily double the storage and 
  transmission requirements of the hash tree, negating one of its key benefits. 
  Random Access - In order to take full advantage of the hash tree construct, it 
  is often necessary to read the elements of the hash tree in a random access 
  fashion. A common usage of this serialization format will be to access hash 
  data over the HTTP protocol using "Range Requests". This will allow 
  implementors to retrieve small bits of hash information on-demand, even 
  requesting different parts of the tree from different hosts on the network. 
3.1 DIME Encapsulation
It is RECOMMENDED that DIME be used to encapsulate the payloads described in 
this specification. The current version of DIME is "draft-nielsen-dime-01" at 
(http://gotdotnet.com/team/xml_wsspecs/dime/default.aspx). 
It is RECOMMENDED that the first payload in the DIME Message be the XML Tree 
Description. The XML Tree description payload MUST be before the the binary 
serialized tree. 
It is RECOMMENDED that the binary serialized tree be stored in a single payload 
rather than using chunked payloads. This will allow implementations to read the 
tree hash data in a random access fashion within the payload. 
3.2 XML Tree Description
The XML Tree Description contains metadata about the hash tree and file that is 
necessary to interpret the binary serialized tree. An important consideration in 
the design of THEX is the intention for it to be received from untrusted sources 
within a distributed network. The only information that needs to be obtained 
from a trusted source is the root hash and the segment size. The root hash by 
itself can be used to verify the integrity of the serialized tree and of the 
file itself. 
It is RECOMMENDED that implementers assume that the serialized file was obtained 
from an untrusted source, thus the use of this format to store non-verifiable 
information, such as general file metadata, is highly discouraged. For instance, 
a malicious party could easily forge metadata, such as the author or file name. 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hashtree system "http://open-content.net/spec/thex/thex.dtd">
<hashtree>
    <file size='1146045066' segmentsize='1024'/>
    <digest algorithm='http://www.w3.org/2000/09/xmldsig#sha1' 
            outputsize='20'/>
    <serializedtree depth='22' 
                    type='http://open-content.net/spec/thex/breadthfirst' 
                    uri='uuid:09233523-345b-4351-b623-5dsf35sgs5d6'/>
</hashtree>
3.2.1 File Size
The file size attribute refers to the size, in bytes, of the file that the hash 
tree was generated from. 
3.2.2 File Segment Size
The file segment size identifies the size, in bytes, of the file segments that 
were used to create the hash tree. As noted in Choice Of Segment Size, it is 
recommended that applications use a small, common segment size such as 1,024 
bytes in order to retain maximum flexibility and interoperability. 
3.2.3 Digest Algorithm
This attribute provides the identifier URI for the digest algorithm. A URI is 
used here as an identifier instead of a regular string to avoid the overhead of 
IANA-style registration. By using URIs, new types can be created without having 
to consult any other entity. The URIs are only to be used for type 
identification purposes, but it is RECOMMENDED that the URIs point to 
information about the given digest function. This convention is inspired by RFC 
3275, the XML Signature Specification. For instance, the SHA-1 algorithm is 
identified by "http://www.w3.org/2000/09/xmldsig#sha1". All digest algorithms 
defined in RFC 3275 are supported. The Tiger algorithm is also supported and is 
identified by "http://open-content.net/spec/digest/tiger". 
3.2.4 Digest Output Size
This attribute specifies the size of the output of the hash function, in bytes. 
3.2.5 Serialized Tree Depth
This attribute specifies the number of levels of the tree that have been 
serialized. This value allows control over the amount of storage space required 
by the serialized tree. In general, each row added to the tree will double the 
storage requirements while also doubling the verification resolution. 
3.2.6 Serialized Tree Type
This attribute provides the identifier URI for the serialization type. Just as 
with the Digest Algorithm, new serialization types can be added and described 
without going through a formal IANA-style process. One serialization type is 
defined for "Breadth-First Serialization" later in this document. 
3.2.7 Serialized Tree URI
This attribute provides the URI of the binary serialized tree payload. If used 
within a DIME payload, it is recommended that this URI be location independant, 
such as the "uuid:" URI's used in the SOAP in DIME specification or SHA-1 URNs. 
3.3 Breadth-First Serialization
Normal breadth-first serialization is the recommended manner in which to 
serialize the hash tree. This format includes the root hash first, and then each 
"row" of hashes is serialized until the tree has been serialized to the lowest 
level as specified by the "Serialized Tree Depth" field. 
For example, consider a file made up of 5 segments, S1, S2, S3, S4, and S5. 
                       ROOT=IH(H+E)
                        /        \
                       /          \
                H=IH(F+G)          E
                /       \           \
               /         \           \
        F=IH(A+B)       G=IH(C+D)     E
        /     \           /     \      \
       /       \         /       \      \
  A=LH(S1)  B=LH(S2) C=LH(S3)  D=LH(S4) E=LH(S5)
The hashes would be serialized in the following order: ROOT, H, E, F, G, E, A, 
B, C, D, E. Notice that E is serialized as a part of the each row. This is due 
to its promotion as there are no available siblings in the lower rows. If we 
choose to serialize the entire tree, the serialized tree depth would be 4, and 
for a 20 byte digest output, the entire tree payload would occupy 11*20 = 220 
bytes. 
3.3.1 Serialization Type URI
The serialization type URI for a Merkle Hash Tree serialized in normal 
breadth-first form is "http://open-content.net/spec/thex/breadthfirst". 



       TOC 

Authors' Addresses
       Justin Chapweske
       Onion Networks, Inc.
       1668 Rosehill Circle
       Lauderdale, MN 55108
       US
      EMail: justin@onionnetworks.com
      URI: http://onionnetworks.com/
        
       Gordon Mohr
       Bitzi, Inc.
       
       
      EMail: gojomo@bitzi.com
      URI: http://bitzi.com/



       TOC 

Appendix A. Test Vectors
The following are test vectors for producing THEX hash trees using the Tiger 
hash algorithm. The 'urn:sha1' entries specify the full file SHA-1 of the data, 
while the 'urn:tree:tiger' entries specify the root of the THEX hash tree of the 
data. 
The empty (zero-length) file:

  urn:sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
  urn:tree:tiger:LWPNACQDBZRYXW3VHJVCJ64QBZNGHOHHHZWCLNQ
  
A file with a single zero byte:

  urn:sha1:LOUTZHNQZ74T6UVVEHLUEDSD63W2E6CP
  urn:tree:tiger:VK54ZIEEVTWNAUI5D5RDFIL37LX2IQNSTAXFKSA

A file with 1024 'A' characters:

  urn:sha1:ORWD6TJINRJR4BS6RL3W4CWAQ2EDDRVU
  urn:tree:tiger:L66Q4YVNAFWVS23X2HJIRA5ZJ7WXR3F26RSASFA
  
A file with 1025 'A' characters:

  urn:sha1:UUHHSQPHQXN5X6EMYK6CD7IJ7BHZTE77
  urn:tree:tiger:PZMRYHGY6LTBEH63ZWAHDORHSYTLO4LEFUIKHWY
